(a) Simulate samples of size n = 100, . . . , 100000 (with step say 1000) from a normal distribution with mean 1 and variance 1, i.e. N(1,1). For each sample esti- mate the mean, the variance. Plot the path of sample means as function of n. What conclusion can we draw from the figure if we keep in mind the law of large numbers?
import numpy as np
import pandas as pd
import altair as alt
alt.renderers.enable('notebook')
np.random.seed(1)
mu, sigma = 1, 1 # mean and standard deviation
list_of_n = [i*100 for i in range(1,10)]
list_of_n.extend([i*1000 for i in range(1,10)])
list_of_n.extend([i*10000 for i in range(1,11)])
samples = [np.random.normal(mu, sigma, n) for n in list_of_n]
means = [sample.mean() for sample in samples]
variances = [sample.var() for sample in samples]
data = pd.DataFrame({'n':list_of_n, 'mean':means, 'variance':variances})
data.transpose()
data['1 value'] = [1]*len(list_of_n)
line = alt.Chart(data).mark_line().encode(
x=alt.X('n:O', ),
y=alt.Y('mean:Q',title ="mean", scale = alt.Scale(domain=[0.9,1.15])),
tooltip = alt.Tooltip(['n:O', 'mean:Q'])
)
line2 = alt.Chart(data).mark_line(color='#d53e4f').encode(
x=alt.X('n:O', ),
y=alt.Y('1 value:Q',title ="", scale = alt.Scale(domain=[0.9,1.15]))
)
alt.layer(line + line2).configure_axisX(
labelAngle=0,
labelPadding=8
).properties(width=900, title="The mean values for different n")
line = alt.Chart(data).mark_line(color="green").encode(
x=alt.X('n:O', ),
y=alt.Y('variance:Q',title ="variance", scale = alt.Scale(domain=[0.75,1.1])),
tooltip = alt.Tooltip(['n:O', 'mean:Q'])
)
line2 = alt.Chart(data).mark_line(color='#d53e4f').encode(
x=alt.X('n:O', ),
y=alt.Y('1 value:Q',title ="", scale = alt.Scale( domain=[0.75,1.1]))
)
alt.layer(line + line2).configure_axisX(
labelAngle=0,
labelPadding=8
).properties(width=900, title="The values of variance for different n")
Law of large numbers
$\lim_{n\to\infty}P(|\bar{X}-\mu|\geq c) = 0$
As we can see on plots - estimated means goes to theoratical mean with increasing n
The same situation is for variances - estimated variances goes to theoratical variance with increasing n
(b) Frequently if is difficult to obtain more data. How many observations do we need in order to obtain an estimator which is close enough (±0.01) to the true value?
The exact Confidence Interval for $\mu$ at the level $1-\alpha$:
$$ \left[ \bar{X} - z_{1- \frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}, \bar{X} + z_{1- \frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}} \right] $$So we want that difference from mean value should be 0.01
If $\alpha=0.05 \Rightarrow 1-\frac{\alpha}{2}=0.975$
import scipy.stats
Z_1_alfa2 = scipy.stats.norm.ppf(0.975)
Z_1_alfa2
$z_{1-\frac{\alpha}{2}} \approx 1.96,~~ \sigma=1$
np.power(100*1*Z_1_alfa2,2)
$\Rightarrow$ We should have at least 38415 observations in order to obtain an estimator which is close enough (±0.01)
(c) Add to the plot the 95% confidence intervals. Do it ones with known $σ$ and ones with an estimated. The confidence intervals have to be constructed manually. Provide their interpretation.
#Calculating confidence margin for each n
confidence_dif = []
for i in range(len(list_of_n)):
var = variances[i]
n = list_of_n[i]
dif = Z_1_alfa2*np.sqrt(var/n)
confidence_dif.append(dif)
Let "confidence_mean_upper" values - $\bar{X_i}+z_{1- \frac{\alpha}{2}}\frac{\sigma}{\sqrt{n_i}}$ $~~~~~~$"confidence_mean_lower" values - $\bar{X_i}-z_{1- \frac{\alpha}{2}}\frac{\sigma}{\sqrt{n_i}}$
data['confidence_dif'] = confidence_dif
data['confidence_mean_upper'] = data['mean'] + data['confidence_dif']
data['confidence_mean_lower'] = data['mean'] - data['confidence_dif']
data[['n','mean', 'confidence_mean_upper', 'confidence_mean_lower', 'confidence_dif']].transpose()
line = alt.Chart(data).mark_line().encode(
x=alt.X('n:O', ),
y=alt.Y('mean:Q',title ="mean", scale = alt.Scale(domain=[0.8,1.05])),
tooltip = alt.Tooltip(['n:O', 'mean:Q'])
)
line2 = alt.Chart(data).mark_line(color='#d53e4f').encode(
x=alt.X('n:O', ),
y=alt.Y('1 value:Q',title ="", scale = alt.Scale(domain=[0.8,1.05]))
)
confidence = alt.Chart(data).mark_area(opacity=0.4).encode(
x=alt.X('n:O', ),
y=alt.Y('confidence_mean_upper:Q', scale = alt.Scale(domain=[0.8,1.25])),
y2=alt.Y2('confidence_mean_lower:Q')
)
alt.layer(confidence + line + line2).configure_axisX(
labelAngle=0,
labelPadding=8
).properties(width=900, title="The confidence intervals for mean values for different n")
As we can see, the bigger n - the narrower the confidence interval
(d) Next plot the sample variance as a function of n. If we think about the consistency of the sample variance as an estimator for $σ^2$, does the figure support this property? Give the interpretation of consistency in your own words.
mu, sigma = 1, 1 # mean and standard deviation
list_of_n_all_values = list(range(100,100001,100))
new_samples = [np.random.normal(mu, sigma, n) for n in list_of_n_all_values]
#new_means = [sample.mean() for sample in samples]
new_variances = [sample.var() for sample in new_samples]
new_data = pd.DataFrame({'n':list_of_n_all_values, 'variance': new_variances})
new_data.transpose()
new_data['1 value'] = [1]*len(list_of_n_all_values)
line = alt.Chart(new_data).mark_line(color="green").encode(
x=alt.X('n:Q', ),
y=alt.Y('variance:Q',title ="variance", scale = alt.Scale(domain=[0.85,1.2])),
tooltip = alt.Tooltip(['n:O', 'variance:Q'])
)
line2 = alt.Chart(new_data).mark_line(color='#d53e4f').encode(
x=alt.X('n:Q', ),
y=alt.Y('1 value:Q',title ="", scale = alt.Scale( domain=[0.8,1.05]))
)
alt.layer(line + line2).configure_axisX(
labelAngle=0,
labelPadding=8
).properties(width=900, title="The values of variance for different n")
As we can see from the plot - estimated variances goes to theoratical variance with increasing n
Consistency:
So in my own words consistency:
The probability - that the module of difference between theoretical and estimated parameter is greater that some constant $c>0$ - goes to $0$ with increasing of n
(a) Let x1,...,xn be a given sample. We assume that it stems from a t-distribution with an unknown number of degrees of freedom. Write down the corresponding log-likelihood function. The density function of the $t_{df}$ -distribution is given by
$$f(x) = \frac{(1+\frac{x^2}{df})^{- \frac{df+1}{2}}}{B(\frac{df}{2},\frac{1}{2})\sqrt{df}}$$
where $B(·, ·)$ is the beta function $(beta(a,b)$ in $R.)$
Likelihood function:
Log-Likelihood function:
Or simpler
def f(x,df):
return (np.power(1+np.power(x,2)/df, -(df+1)/2.))/(scipy.special.beta(df/2., 0.5)*np.sqrt(df))
def LogLikelihood(df,x):
return np.sum(np.log(f(x,df)))
(b) Simulate a sample of size $n = 100$ from $t_5$. Maximize the log-likelihood function (numerically) for the given sample and obtain the ML estimator of the number of degrees of freedom. Compare the estimator with the true value.
import scipy.stats as sts
n=100
sample = sts.t.rvs(df=5, size=n)
source = pd.DataFrame({'sample': sample})
alt.Chart(source).mark_area(
opacity=0.9,
interpolate='step'
).encode(
alt.X('sample:Q',bin=alt.Bin(maxbins=25)),
alt.Y('count()')
).properties(width=600)
! Standart Python realization have optimization only for minimization problem, so for maximization we should change sign of output
def LL_for_optimize(df, x=sample):
return - LogLikelihood(df,x)
LogLikelihood(df=5, x=sample)
LL_for_optimize(df=5, x=sample)
scipy.optimize.minimize(LL_for_optimize, [5])
$\Rightarrow$Estimator gives value 4, but true value is 5
(c) Increase the sample to $n = 5000$ and compare the new estimator with the true value. Which property of the estimator we expect to observe?
n=5000
new_sample = sts.t.rvs(df=5, size=n)
source = pd.DataFrame({'sample': new_sample})
alt.Chart(source).mark_area(
opacity=0.99,
interpolate='step'
).encode(
alt.X('sample:Q',bin=alt.Bin(maxbins=100)),
alt.Y('count()')
).properties(width=600)
def LL_for_optimize(df, x=new_sample):
return - LogLikelihood(df,x)
scipy.optimize.minimize(LL_for_optimize, [5])
For $n=5000$ estimator gives value $5$ which is true value
(a) Simulate $b = 1000$ samples of size $n = 100$ from a $χ2$ distribution with mean 1 and variance $1$. For each sample estimate the mean, the variance and keep them. Plot the histogram for one of the sample, so that you get a better feeling how the original distribution looks like.
Simulating samples
b=1000
n=100
df =2
samples = []
for i in range(b):
sample = sts.chi2.rvs(df, size=n)
samples.append(sample)
mean = np.array(samples).mean()
mean
variance = np.array(samples).var()
variance
rescaled_samples = (np.array(samples) - mean)/np.sqrt(variance) + 1
rescaled_samples.mean()
rescaled_samples.var()
means = []
variances = []
for i in range(b):
means.append(rescaled_samples[i].mean())
variances.append(rescaled_samples[i].var())
data = pd.DataFrame({'mean':means, 'variance': variances})
data.transpose()
sample = rescaled_samples[5]
source = pd.DataFrame({'sample': sample})
alt.Chart(source).mark_area(
opacity=0.99,
interpolate='step'
).encode(
alt.X('sample:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
).properties(width=600)
(b) Plot the KDE or a histogram for the sample of means and the sample of variances. Add the density of the normal distribution for comparison purposes. Compare the density estimator with the normal density. What do you expect and why (statistical reasoning!)?
# mu = 1
# sigma = 0.25
mu = np.array(means).mean()
sigma = np.array(means).std()
normal_distribution = np.random.normal(mu, sigma, b)
data['normal_d'] = normal_distribution
data.transpose()
#ax = data['mean'].plot.kde()
#ax = pd.Series(normal_distribution).plot.kde()
means_plot = alt.Chart(data).mark_area(
color="blue",
opacity=0.5,
interpolate='step'
).encode(
alt.X('mean:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
)
normal_dist_plot = alt.Chart(data).mark_area(
color="red",
opacity=0.5,
interpolate='step'
).encode(
alt.X('normal_d:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
)
alt.layer(normal_dist_plot+means_plot).properties(width=600, title="Compare means with normal distribution")
mu = np.array(variances).mean()
sigma = np.array(variances).std()
normal_distribution = np.random.normal(mu, sigma, b)
data['normal_d'] = normal_distribution
data.transpose()
variance_plot = alt.Chart(data).mark_area(
color="green",
opacity=0.5,
interpolate='step'
).encode(
alt.X('variance:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
)
normal_dist_plot = alt.Chart(data).mark_area(
color="red",
opacity=0.5,
interpolate='step'
).encode(
alt.X('normal_d:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
)
alt.layer(normal_dist_plot+variance_plot).properties(width=600, title="Compare means with normal distribution")
(c) In the lecture we discussed the CLT (Central Limit Theorem) for the sample mean. Here it seems to apply to the sample variance too. Why?
Central Limit Theorem:
$\sqrt{n} \frac{\bar{X}- \mu}{\sigma}_{n \rightarrow \infty} \longrightarrow N(0, 1) $
For variance I have found and understood similar proof, that is described on this page
(d) Let $n$ take values $10^3,10^4,10^5$ and $10^6$. Check the impact of $n$ on the results. Can the statement of the CLT be confirmed?
b=1000
df=2
n=1000
samples = []
for i in range(b):
sample = sts.chi2.rvs(df, size=n)
samples.append(sample)
rescaled_samples = (np.array(samples) - mean)/np.sqrt(variance) + 1
means = []
variances = []
for i in range(b):
means.append(rescaled_samples[i].mean())
variances.append(rescaled_samples[i].var())
sample = rescaled_samples[5]
means = []
variances = []
for i in range(b):
means.append(rescaled_samples[i].mean())
variances.append(rescaled_samples[i].var())
source = pd.DataFrame({'sample': sample})
alt.Chart(source).mark_area(
opacity=0.99,
interpolate='step'
).encode(
alt.X('sample:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
).properties(width=600)
mu = np.array(means).mean()
sigma = np.array(means).std()
normal_distribution = np.random.normal(mu, sigma, b)
data = pd.DataFrame({'mean':means, 'variance': variances})
data['normal_d'] = normal_distribution
means_plot = alt.Chart(data).mark_area(
color="blue",
opacity=0.5,
interpolate='step'
).encode(
alt.X('mean:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
)
normal_dist_plot = alt.Chart(data).mark_area(
color="red",
opacity=0.5,
interpolate='step'
).encode(
alt.X('normal_d:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
)
alt.layer(normal_dist_plot+means_plot).properties(width=600, title="Compare means with normal distribution")
mu = np.array(variances).mean()
sigma = np.array(variances).std()
normal_distribution = np.random.normal(mu, sigma, b)
data['normal_d'] = normal_distribution
variance_plot = alt.Chart(data).mark_area(
color="green",
opacity=0.5,
interpolate='step'
).encode(
alt.X('variance:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
)
normal_dist_plot = alt.Chart(data).mark_area(
color="red",
opacity=0.5,
interpolate='step'
).encode(
alt.X('normal_d:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
)
alt.layer(normal_dist_plot+variance_plot).properties(width=600, title="Compare means with normal distribution")
alt.data_transformers.enable('default', max_rows=None)
b=1000
df=2
n=10000
samples = []
for i in range(b):
sample = sts.chi2.rvs(df, size=n)
samples.append(sample)
rescaled_samples = (np.array(samples) - mean)/np.sqrt(variance) + 1
means = []
variances = []
for i in range(b):
means.append(rescaled_samples[i].mean())
variances.append(rescaled_samples[i].var())
sample = rescaled_samples[5]
means = []
variances = []
for i in range(b):
means.append(rescaled_samples[i].mean())
variances.append(rescaled_samples[i].var())
source = pd.DataFrame({'sample': sample})
alt.Chart(source).mark_area(
opacity=0.99,
interpolate='step'
).encode(
alt.X('sample:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
).properties(width=600)
mu = np.array(means).mean()
sigma = np.array(means).std()
normal_distribution = np.random.normal(mu, sigma, b)
data = pd.DataFrame({'mean':means, 'variance': variances})
data['normal_d'] = normal_distribution
means_plot = alt.Chart(data).mark_area(
color="blue",
opacity=0.5,
interpolate='step'
).encode(
alt.X('mean:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
)
normal_dist_plot = alt.Chart(data).mark_area(
color="red",
opacity=0.5,
interpolate='step'
).encode(
alt.X('normal_d:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
)
alt.layer(normal_dist_plot+means_plot).properties(width=600, title="Compare means with normal distribution")
mu = np.array(variances).mean()
sigma = np.array(variances).std()
normal_distribution = np.random.normal(mu, sigma, b)
data['normal_d'] = normal_distribution
variance_plot = alt.Chart(data).mark_area(
color="green",
opacity=0.5,
interpolate='step'
).encode(
alt.X('variance:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
)
normal_dist_plot = alt.Chart(data).mark_area(
color="red",
opacity=0.5,
interpolate='step'
).encode(
alt.X('normal_d:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
)
alt.layer(normal_dist_plot+variance_plot).properties(width=600, title="Compare means with normal distribution")
b=1000
df=2
n=100000
samples = []
for i in range(b):
sample = sts.chi2.rvs(df, size=n)
samples.append(sample)
rescaled_samples = (np.array(samples) - mean)/np.sqrt(variance) + 1
means = []
variances = []
for i in range(b):
means.append(rescaled_samples[i].mean())
variances.append(rescaled_samples[i].var())
sample = rescaled_samples[5]
means = []
variances = []
for i in range(b):
means.append(rescaled_samples[i].mean())
variances.append(rescaled_samples[i].var())
source = pd.DataFrame({'sample': sample})
alt.Chart(source).mark_area(
opacity=0.99,
interpolate='step'
).encode(
alt.X('sample:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
).properties(width=600)
mu = np.array(means).mean()
sigma = np.array(means).std()
normal_distribution = np.random.normal(mu, sigma, b)
data = pd.DataFrame({'mean':means, 'variance': variances})
data['normal_d'] = normal_distribution
means_plot = alt.Chart(data).mark_area(
color="blue",
opacity=0.5,
interpolate='step'
).encode(
alt.X('mean:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
)
normal_dist_plot = alt.Chart(data).mark_area(
color="red",
opacity=0.5,
interpolate='step'
).encode(
alt.X('normal_d:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
)
alt.layer(normal_dist_plot+means_plot).properties(width=600, title="Compare means with normal distribution")
mu = np.array(variances).mean()
sigma = np.array(variances).std()
normal_distribution = np.random.normal(mu, sigma, b)
data['normal_d'] = normal_distribution
variance_plot = alt.Chart(data).mark_area(
color="green",
opacity=0.5,
interpolate='step'
).encode(
alt.X('variance:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
)
normal_dist_plot = alt.Chart(data).mark_area(
color="red",
opacity=0.5,
interpolate='step'
).encode(
alt.X('normal_d:Q',bin=alt.Bin(maxbins=20)),
alt.Y('count()')
)
alt.layer(variance_plot + normal_dist_plot).properties(width=600, title="Compare means with normal distribution")
#normal_dist_plot
As we see, with increasing n the distribution of means and distribution of variances better approximates with normal distribution
(a) Simulate a sample of length $n = 100$ from a normal distribution with mean $μ_0 = 500$ and variance $σ^2 = 50$. (Note: you may use the transformation $X = μ + σZ$, where $Z ∼ N(0,1)$.) The objective is to test the null hypothesis $H_0 : μ = 500$. Assume that $σ^2$ has to be estimated. Compute the test statistics using the formulas in the lecture; determine the rejection area for $α = 0.04$ and decide if $H_0$ can to be rejected.
import scipy.stats as sts
np.random.seed(1)
n=100
mu = 500
variance = 50
sigma = np.sqrt(variance)
normal_distribution = np.random.normal(mu, sigma, n)
source = pd.DataFrame({'sample': normal_distribution})
alt.Chart(source).mark_area(
opacity=0.99,
interpolate='step'
).encode(
alt.X('sample:Q',bin=alt.Bin(maxbins=10)),
alt.Y('count()')
).properties(width=600)
Two-sided test
$H_0: \mu_0=500$
$H_1: \mu_0 \neq 500$
$X \sim N(\mu, \sigma^2)$ - we know that our sample from normal distribution
$\sigma^2$ - estimated by $s^2 \Rightarrow$ $\sigma^2$ - unknown $\Rightarrow$ we should use $V=\sqrt{n}\frac{\bar{X}-\mu_0}{s} \sim t_{n-1}$ statistics
mean = normal_distribution.mean()
mean
std = normal_distribution.std()
std
v = np.sqrt(n)*(mean - mu)/std
v
$H_0$ is rejected if $|v|>z_{1-\frac{\alpha}{2}}$
$\alpha = 0.04~~\Rightarrow 1-\frac{\alpha}{2} = 0.98$
Z_1_alfa2 = sts.norm.ppf(0.98)
Z_1_alfa2
np.abs(v) > Z_1_alfa2
$\Rightarrow |v| \ngtr z_{1-\frac{\alpha}{2}}\Rightarrow~~ H_0$ is not rejected
(b) Determine the $p$-values using the formulas from the lecture and compare/check the results using a build-in function for this test in R or Python. Give a verbal interpretation of the obtained $p$-value.
$p$-value is the largest level of significance for which $H_0$ is still not rejected
The smaller the $p$-value is, the more evidence the sample contains against $H_0$
$\alpha > p \Rightarrow H_0$ rejected
$\alpha \leq p \Rightarrow H_0$ not rejected
Case: two-sided T-test
$F_{St_{n-1}}(v) = \frac{p}{2}$ if $v<0$
$F_{St_{n-1}}(v) = 1 - \frac{p}{2}$ if $v>0$
$\Rightarrow p=2(1 - F_{St(n-1)}(|v|))~~$ if $~~H_1: \mu \neq \mu_0$
p = 2*(1- sts.t.cdf(np.abs(v),n-1))
print('T-statistics: ',v)
print('\np-value: ', p)
$\alpha = 0.04$
$\alpha \leq p \Rightarrow H_0$ not rejected
sts.ttest_1samp(normal_distribution, mu)
v_val, p_val = sts.ttest_1samp(normal_distribution, mu)
print('T-statistics: ',v_val)
print('\np-value: ', p_val)
As we see, p-values are almost the same
(c) Simulate $M = 1000$ samples of size $n = 100$ and with $μ_0 = 500$ and variance $σ^2 = 50$.
For each sample $i$ run the test (using a standard function) and set $p_i = 0$ if $H_0$ is not rejected and $p_i = 1$ is rejected. Compute $\hat{α} = \frac{1}{M}\sum_{i=1}^M p_i$.
$\hat{α}$ is the empirical confidence level (empirical size) of the test. Compare $\hat{α}$ with $α$. Do you expect the difference to be large or small and why? Relate it to the assumptions of the test.
My assumption: they should be the same
$\hat{α}$ is a estimated value of theoretical value $\alpha$
$\alpha$ is the probability to make error of the 1st type
$P(\{H_1\}|H_0) = \alpha$
So we will count how often we make error of the 1st type
M=1000
n=100
mu = 500
variance = 50
sigma = np.sqrt(variance)
alpha = 0.04
np.random.seed(4)
samples = [np.random.normal(mu, sigma, n) for i in range(M)]
results = []
for sample in samples:
v, p_val = sts.ttest_1samp(sample, mu)
if alpha > p_val:
results.append(1)
else:
results.append(0)
alpa_hat = 1./M * sum(results)
alpa_hat
As we see - $\hat{α}$ is almost the same as $α$
(d) Assume now that one of the assumptions is not satisfied. For example, the data is in fact not normal. Repeat the above analysis, but simulate a sample $z_1,...,z_n$ from $t$-distribution with 3 degrees of freedom. Compute $x_i = 500 + \sqrt{50}\cdot\frac{z_i}{\sqrt{3}}$.
(Note: This will guarantee the same expectation and the same variance as the above normal distribution.) Recompute a new
hatα. What do you expect? Relate your answer to the type one error and the underlying assumptions.
After such rescaling our samples from students distribution become very similar to normal distributions with parameters $\mu = 500$ and $\sigma^2=50$
That's why our estimated value $\hat{α}$ should be also the same $α$, or maybe with small difference
np.random.seed(0)
n=100
mu = 500
variance = 50
sigma = np.sqrt(variance)
t_distribution = sts.t.rvs(df=3, size=n)
rescaled_distribution = 500 + np.sqrt(50/3)*t_distribution
source = pd.DataFrame({'rescaled_distribution': rescaled_distribution})
alt.Chart(source).mark_area(
opacity=0.99,
interpolate='step'
).encode(
alt.X('rescaled_distribution:Q',bin=alt.Bin(maxbins=10)),
alt.Y('count()')
).properties(width=600)
M=1000
n=100
mu = 500
variance = 50
sigma = np.sqrt(variance)
alpha = 0.04
samples = [sts.t.rvs(df=3, size=n) for i in range(M)]
rescaled_samples = 500 + np.sqrt(50/3)*np.array(samples)
results = []
for sample in rescaled_samples:
v, p_val = sts.ttest_1samp(sample, mu)
if alpha > p_val:
results.append(1)
else:
results.append(0)
alpa_hat = 1./M * sum(results)
alpa_hat
As we see - $\hat{α}$ and $α$ are very simmilar
(e) Power of a test: The first objective is to assess the probability of type 2 error (power of a test) of goodness-of-fit test. Goodness-of-fit tests for the normal distribution are of key importance in statistics, since they allow to verify the distributional assumptions required in many models. Here we check the power of the Kolmogorov-Smirnov test, i.e. is the test capable to detect deviations from normality?
Simulate $M = 1000$ samples of size $100$ from a $t$-distribution with $df = 2,....,50$ degrees of freedom. For each sample run the Kolmogorov-Smirnov test and count the cases when the $H_0$ of normality is correctly rejected (for each df). How would you use this quantity to estimate the power of the test? Make an appropriate plot with the df on the X-axis. (Note: the $t$-distribution converges to the normal distribution as $df$ tends to infinity. For $df > 50$ the distributions are almost identical.) Discuss the plot and draw conclusions about the reliability of the test.
Let $\alpha=0.05$
M=1000
n=100
alpha = 0.05
result_sums = []
for df in range(2, 51):
samples = [sts.t.rvs(df=df, size=n) for i in range(M)]
sum_of_rejected = 0
for sample in samples:
statistics_value, p_value = sts.kstest(sample,'norm')
if alpha > p_value:
sum_of_rejected += 1
result_sums.append(sum_of_rejected)
source = pd.DataFrame({'df':list(range(2, 51)), 'sum_of_rejected':result_sums})
line = alt.Chart(source).mark_line(point=True).encode(
x='df:O',
y='sum_of_rejected:Q',
tooltip = alt.Tooltip(['df:O', 'sum_of_rejected:Q'])
)
line.properties(width=900).configure_axisX(
labelAngle=0,
labelPadding=8
)
Power of the test = $1-\beta$
$\beta$ - is a probability to make error of the 2nd
In our case - (not reject $H_0$ - that distribution is normal when distribution is not normal)
For small values of df our distributions are very unlike from normal distribution - that's why power is bigger
For big values of df our distributions look like normal distribution - that's why power of test is small
So this plot shows trend how changes power of test for differnt df - degrees of freedom